1 Introduction

Analysis of differential gene expression has been proved useful in many fields and applications. Related to RNA sequencing, differential expression analysis means taking the normalized read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, the use of statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation.

In this report we are going to use different methods to analyze the raw RNA-seq data of some uterine cancer samples.

Uterine cancer is a disease that develops in the tissues of the uterus. The uterus or womb is a secondary sex organ of the female reproductive system in most mammals, including humans, where the fetus develops during gestation. Depending on the specific area where a tumor develops, this kind of cancer can be classified in two types: endometrial cancer forms from the lining of the uterus and uterine sarcoma forms from the muscles or support tissue of the uterus.

The samples used in this report belong to the first, more common and less aggressive type mentioned. Although it can usually be cured, the endometrial carcinogenesis is thought to be caused by a multi-step process that involves the interaction of hormonal regulation, gene mutation, adhesion molecules and apoptosis, and as many other complex types of cancer, it’s molecular pathogenesis is not completely described.

The samples used for the analysis come from a cohort of The Cancer Genome Atlas (TCGA), and we are testing the hypothesis that differential expression between tumor and control samples can reveal specific gene sets involved in the mentioned unknown pathogenesis.

As the input data is in raw RNA read counts, the first part of the project will be a summary and information extraction of the dataset.

2 Summary and information extraction of the data

We start importing the raw table of counts. We can observe 589 patients and 20115 transcripts.

library(SummarizedExperiment)
se <- readRDS(file.path("rawCounts", "seUCEC.rds"))
se ##589 samples // 20115 gene transcripts 
class: RangedSummarizedExperiment 
dim: 20115 589 
metadata(5): experimentData annotation cancerTypeCode
  cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(589): TCGA.2E.A9G8.01A.11R.A40A.07
  TCGA.4E.A92E.01A.11R.A37O.07 ... TCGA.FL.A1YV.11A.12R.A32Y.07
  TCGA.FL.A3WE.11A.11R.A22K.07
colData names(549): type bcr_patient_uuid ...
  lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
#se$gender
table (se$gender)

FEMALE 
   556 
table(se$type) #35 normal and 554 tumor.

normal  tumor 
    35    554 
table(se$gender, se$type)
        
         normal tumor
  FEMALE     23   533
sum(is.na(se$gender)) ##number of NA samples in termes of gender 
[1] 33

We can observe the information of the gender from our object data. We exepected to have all female samples but we noticed that we have 33 NA samples, that have to be mantained because are normal ones. Since the number of normal samples is smaller than the number of tumor samples it is important to mantain those normal samples in order to perform the comparison. In the last table we can see a Sex x Type Summarize, and we can see that only 23 samples are cataloged as female and normal.

2.1 Exploration of the data columns

We can observe the phenotypic data information that is related to the patient’s samples. Notice that we are working with a s4 object which is strict, formal and rigurous. We can observe 589 patients and 20115 transcripts. Also there are 549 clinical variables for each of the 589 patients.

dim(colData(se)) #549 clinical variables for each of the 589 patients
[1] 589 549
colData(se)[1:5, 1:5] #example of 5 rows (patients) and 5 columns 
DataFrame with 5 rows and 5 columns
                                 type                     bcr_patient_uuid
                             <factor>                             <factor>
TCGA.2E.A9G8.01A.11R.A40A.07    tumor 9583C10B-B21A-4863-98FA-61E735E64EA5
TCGA.4E.A92E.01A.11R.A37O.07    tumor                                   NA
TCGA.5B.A90C.01A.11R.A37O.07    tumor 16AC4341-CF8F-45E2-B90B-2D12D5F74A59
TCGA.5S.A9Q8.01A.11R.A40A.07    tumor 8751429B-4A11-451E-B978-DC9E9DB0EB36
TCGA.A5.A0G1.01A.11R.A118.07    tumor 53707bb3-426a-43cb-830f-3eeed930295f
                             bcr_patient_barcode form_completion_date
                                        <factor>             <factor>
TCGA.2E.A9G8.01A.11R.A40A.07        TCGA-2E-A9G8            2014-5-21
TCGA.4E.A92E.01A.11R.A37O.07                  NA                   NA
TCGA.5B.A90C.01A.11R.A37O.07        TCGA-5B-A90C            2014-4-17
TCGA.5S.A9Q8.01A.11R.A40A.07        TCGA-5S-A9Q8           2013-12-23
TCGA.A5.A0G1.01A.11R.A118.07        TCGA-A5-A0G1             2011-1-7
                             prospective_collection
                                           <factor>
TCGA.2E.A9G8.01A.11R.A40A.07                     NO
TCGA.4E.A92E.01A.11R.A37O.07                     NA
TCGA.5B.A90C.01A.11R.A37O.07                     NO
TCGA.5S.A9Q8.01A.11R.A40A.07                    YES
TCGA.A5.A0G1.01A.11R.A118.07                     NO
mcols(colData(se), use.names=TRUE) 
DataFrame with 549 rows and 2 columns
                                                         labelDescription
                                                              <character>
type                                           sample type (tumor/normal)
bcr_patient_uuid                                         bcr patient uuid
bcr_patient_barcode                                   bcr patient barcode
form_completion_date                                 form completion date
prospective_collection            tissue prospective collection indicator
...                                                                   ...
lymph_nodes_pelvic_pos_total                               total pelv lnp
lymph_nodes_aortic_examined_count                           total aor lnr
lymph_nodes_aortic_pos_by_he                          aln pos light micro
lymph_nodes_aortic_pos_by_ihc                                 aln pos ihc
lymph_nodes_aortic_pos_total                                total aor-lnp
                                        CDEID
                                  <character>
type                                       NA
bcr_patient_uuid                           NA
bcr_patient_barcode                   2673794
form_completion_date                       NA
prospective_collection                3088492
...                                       ...
lymph_nodes_pelvic_pos_total          3151828
lymph_nodes_aortic_examined_count     3104460
lymph_nodes_aortic_pos_by_he          3151832
lymph_nodes_aortic_pos_by_ihc         3151831
lymph_nodes_aortic_pos_total          3151827

These metadata consists of two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be used in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search. The CDE was originally intended to be a standard nomenclature for the reporting of Phase 3 cancer clinical trials data.

2.2 Exploration of the data rows

Now, explore the row (feature) data.

rowData(se) #Each of the genes symbol, their lenght and the gc content.
DataFrame with 20115 rows and 3 columns
               symbol     txlen              txgc
          <character> <integer>         <numeric>
1                A1BG      3322 0.564419024683925
2                 A2M      4844 0.488232865400495
9                NAT1      2280 0.394298245614035
10               NAT2      1322 0.389561270801815
12           SERPINA3      3067 0.524942940984676
...               ...       ...               ...
100996331       POTEB      1706 0.430832356389215
101340251    SNORD124       104 0.490384615384615
101340252   SNORD121B        81 0.407407407407407
102724473      GAGE10       538 0.505576208178439
103091865   BRWD1-IT2      1028 0.592412451361868
#Gets the range of values in each row of a matrix.
rowRanges(se) #The chromosome in which the gene is located, the range, the strand, the symbol, the lenght and the gc content.
GRanges object with 20115 ranges and 3 metadata columns:
            seqnames            ranges strand |      symbol     txlen
               <Rle>         <IRanges>  <Rle> | <character> <integer>
          1    chr19 58345178-58362751      - |        A1BG      3322
          2    chr12   9067664-9116229      - |         A2M      4844
          9     chr8 18170477-18223689      + |        NAT1      2280
         10     chr8 18391245-18401218      + |        NAT2      1322
         12    chr14 94592058-94624646      + |    SERPINA3      3067
        ...      ...               ...    ... .         ...       ...
  100996331    chr15 20835372-21877298      - |       POTEB      1706
  101340251    chr17 40027542-40027645      - |    SNORD124       104
  101340252     chr9 33934296-33934376      - |   SNORD121B        81
  102724473     chrX 49303669-49319844      + |      GAGE10       538
  103091865    chr21 39313935-39314962      + |   BRWD1-IT2      1028
                         txgc
                    <numeric>
          1 0.564419024683925
          2 0.488232865400495
          9 0.394298245614035
         10 0.389561270801815
         12 0.524942940984676
        ...               ...
  100996331 0.430832356389215
  101340251 0.490384615384615
  101340252 0.407407407407407
  102724473 0.505576208178439
  103091865 0.592412451361868
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome

The genomic ranges of the transcripts are not the ones that are represented in NCBI database using the hg38 version. Example: A1BG: 58346806..58353499 complement instead of 58345178-58362751 A2M: 9067708..9116229 complement instead of 9067664-9116229

However we consider that is not relevant in relation to our study, because the variance is minimal.

2.3 Exploring the assays before normalizing the data

To retrieve the experiment data from a SummarizedExperiment object one can use the assays() accessor. An object can have multiple assay datasets each of which can be accessed using the $ operator. The airway dataset contains only one assay (counts). Here, each row represents a gene transcript and each column one of the samples.

We can observe a table in which there is a representation of gene transcripts (rows) of each of the samples (columns). A summary of the counts is printed in order to see the tendency of our data.

assays(se)
List of length 1
names(1): counts
assays(se)$counts[1:5, 1:5] #Each row represents a gene transcript and each column one of the samples. 
   TCGA.2E.A9G8.01A.11R.A40A.07 TCGA.4E.A92E.01A.11R.A37O.07
1                            48                           39
2                         13605                        31176
9                             0                            0
10                            0                            0
12                         1356                         1887
   TCGA.5B.A90C.01A.11R.A37O.07 TCGA.5S.A9Q8.01A.11R.A40A.07
1                            20                           28
2                         13250                         2876
9                             0                            0
10                            0                            0
12                         1478                          639
   TCGA.A5.A0G1.01A.11R.A118.07
1                            77
2                          7067
9                             0
10                            0
12                          930
countexpr <- assays(se)$counts
dim(countexpr)
[1] 20115   589
#Summary statistics of the sequencing depth that mapped the genes. 
summary(colSums(countexpr))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 3345074 17330822 20439586 24645417 33657247 60097229 

In the previous summary we can see some visualization of our data, including the header of a table where the rows are genes and the colums are samples (so the number would be the reads mapping that gene in each individual) and some main statistics below.

3 Quality assessment and normalization

The main goal of this step is to bring the samples to a level that can be comparable removing the technical differences between them.

We have to take into account 2 steps, within-samples and between-samples normalization:

  • Within-samples: adjustments to compare across features in a sample. It would like to make the reads of the two genes comparable between them. It contemplates the fact that some samples may be sequenced with more depth.
    • Scaling: using counts per million reads (CPM) mapped to the genome.
  • Between-samples: adjustments to compare a feature across samples.
    • Sample-specific normalization factors: using the TMM algorithm from the R/Bioconductor package edgeR: sample specific normalization factor
    • We could also use quantile normalization but is not the best option for our data.

We need first to load the edgeR R/Bioconductor package and create a `DGEList’ object.

library(edgeR)

dge <- DGEList(counts=assays(se)$counts, genes=mcols(se))
saveRDS(dge, "results/dge1.rds")
names(dge)
[1] "counts"  "samples" "genes"  
head(dge$samples) #Data frame as many rows as samples. Group column (natural grouping tool). Lib size adds up counts per sample
                             group lib.size norm.factors
TCGA.2E.A9G8.01A.11R.A40A.07     1 34953667            1
TCGA.4E.A92E.01A.11R.A37O.07     1 37504801            1
TCGA.5B.A90C.01A.11R.A37O.07     1 51816834            1
TCGA.5S.A9Q8.01A.11R.A40A.07     1 10067025            1
TCGA.A5.A0G1.01A.11R.A118.07     1 19273702            1
TCGA.A5.A0G2.01A.11R.A16W.07     1 20526431            1
#lib size is read addition in a column
head(colSums(assays(se)$counts)) ##We can confirm it looking at these values
TCGA.2E.A9G8.01A.11R.A40A.07 TCGA.4E.A92E.01A.11R.A37O.07 
                    34953667                     37504801 
TCGA.5B.A90C.01A.11R.A37O.07 TCGA.5S.A9Q8.01A.11R.A40A.07 
                    51816834                     10067025 
TCGA.A5.A0G1.01A.11R.A118.07 TCGA.A5.A0G2.01A.11R.A16W.07 
                    19273702                     20526431 
#norm factors (by deafult is 1) 

Now calculate \(\log_2\) CPM values of expression and put them as an additional assay element to ease their manipulation.

assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
#Log of 0 does not exist so we use a prior count value that is added in all the values (log between 0 and 1 is negative).
assays(se)$logCPM[1:5, 1:5]
   TCGA.2E.A9G8.01A.11R.A40A.07 TCGA.4E.A92E.01A.11R.A37O.07
1                     0.4787484                   0.08427463
2                     8.6045526                   9.69917999
9                    -5.6232476                  -5.62324755
10                   -5.6232476                  -5.62324755
12                    5.2785238                   5.65345688
   TCGA.5B.A90C.01A.11R.A37O.07 TCGA.5S.A9Q8.01A.11R.A40A.07
1                     -1.299515                     1.486274
2                      7.998470                     8.158385
9                     -5.623248                    -5.623248
10                    -5.623248                    -5.623248
12                     4.835107                     5.988568
   TCGA.A5.A0G1.01A.11R.A118.07
1                      2.005532
2                      8.518400
9                     -5.623248
10                    -5.623248
12                     5.593132

3.1 Sequencing depth

A very first basic QA diagnostic of expression profiles in RNA-seq data is to examine the sequencing depth of mapped reads in increasing order with a bar plot. Labeling/coloring bars for each grup of samples can help to identify problems.

Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample:

[1] 14516150674
[1] 24645417
Library sizes in increasing order.

Figure 1: Library sizes in increasing order

We have colored the bars using the type condition (tumor/normal) because we want to spot differences of expression between these conditions.

If the tumor samples would be in one site and the other ones in the other, it would be a problem, meaning that the control and tumor samples have been differently measured.

As there is a large amount of samples, it is difficult to distinguish between tumor and normal samples in this plot, also because there are much more control samples than tumor ones. In further steps, it may be necessary to work with a subset in order to obtain clearer results. However, the figure S1.1 reveals substantial differences in sequencing depth between samples and it may be considered discarding those samples whose depth is substantially lower than the rest.

We can not distinguish the type of the samples using the color approach because of the huge number of samples so we should work with a subset to observe clearer results. We can only observe that some samples have lower depth than others. To identify which are these samples we may simply look at the actual numbers including portion of the sample identifier that distinguishes them.

sampledepth <- round(dge$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(se), 6, 12)
s <- sampledepth[sampledepth > 24.6]
sort(s)
AX.A2HC AX.A2H4 AX.A060 B5.A3S1 AX.A3G7 AJ.A3OK AX.A3FX FL.A1YH EO.A3AS 
   24.7    25.0    25.4    25.7    25.9    26.1    26.2    27.3    28.4 
AX.A3GB JU.AAVI FL.A1YQ B5.A3FC DF.A2KZ AX.A3G1 DF.A2L0 AJ.A3BD AJ.A8CT 
   28.5    28.5    28.8    28.9    28.9    29.0    29.4    29.8    29.9 
B5.A3FH BK.A0CA EY.A1GO DI.A1C3 AJ.A3BH AX.A3FW DI.A2QY EO.A3KU DF.A2KV 
   29.9    30.0    30.0    30.2    30.5    30.5    30.5    31.1    31.3 
DI.A2QY D1.A2G0 PG.A917 AJ.A3EJ B5.A3FB FL.A1YV AX.A3GI B5.A0JR A5.A2K4 
   31.6    31.7    31.7    31.8    31.9    32.2    32.3    32.3    32.7 
E6.A2P8 EO.A3KX EY.A2ON QS.A5YQ DF.A2KU AJ.A3OJ AX.A3G4 EO.A3B1 EY.A2OO 
   32.7    32.8    32.8    33.0    33.2    33.3    33.3    33.3    33.4 
AJ.A3NH AP.A3K1 B5.A3F9 A5.A2K2 D1.A3DH AJ.A3BF B5.A3FA BK.A139 D1.A3DG 
   33.5    33.7    33.7    33.8    33.8    33.9    33.9    34.1    34.3 
KP.A3W1 AJ.A3BG BG.A3PP AX.A1C7 DI.A2QT A5.A2K5 AX.A3G3 EO.A3AZ EO.A3AV 
   34.3    34.4    34.5    34.6    34.6    34.7    34.7    34.7    34.8 
K6.A3WQ DF.A2KR BG.A2AD FL.A1YN 2E.A9G8 AX.A3G8 AX.A0J0 AJ.A5DV EY.A3L3 
   34.8    34.9    34.9    34.9    35.0    35.2    35.3    35.4    35.4 
KP.A3W3 FL.A1YF AX.A3G6 EY.A2OP FI.A3PV EO.A3B0 QS.A5YR B5.A3FD AJ.A2QM 
   35.4    35.5    35.6    35.6    35.6    35.7    35.7    35.8    35.9 
EO.A22U BK.A6W4 AX.A3G9 FI.A3PX EO.A22X AJ.A3NC AJ.A3BK AJ.A8CV BG.A3EW 
   36.0    36.2    36.3    36.4    36.5    36.7    36.9    37.0    37.0 
FL.A1YT AX.A3FT BG.A0MK A5.A2K7 4E.A92E AJ.A3EK AX.A2HC AJ.A8CW EO.A1Y7 
   37.1    37.2    37.3    37.4    37.5    37.5    37.5    37.6    37.6 
EY.A2OQ B5.A5OC BK.A56F EO.A3AU EO.A3AY AJ.A5DW B5.A11R EY.A4KR PG.A5BC 
   37.6    37.9    37.9    37.9    38.0    38.1    38.2    38.2    38.2 
E6.A2P9 EO.A3L0 AX.A2IN AX.A05Y FL.A1YI AJ.A3BI DF.A2KY KP.A3W0 FL.A3WE 
   38.3    38.3    38.7    38.8    38.8    38.9    38.9    38.9    38.9 
DF.A2KN EY.A547 E6.A8L9 EY.A54A D1.A3DA FL.A1YL AJ.A3OL AJ.A3EL A5.AB3J 
   39.0    39.1    39.2    39.2    39.3    39.3    39.5    39.6    39.7 
BG.A3PP BK.A13B QF.A5YT PG.A6IB EY.A1GX EO.A22Y BK.A4ZD EY.A1GL EY.A549 
   39.7    39.8    39.9    40.1    40.2    40.3    40.3    40.5    40.5 
AX.A2HD BK.A139 AJ.A3TW AJ.A3I9 EY.A210 KP.A3VZ A5.A3LP EY.A548 EY.A5W2 
   40.5    41.1    41.4    41.5    41.6    41.7    42.1    42.8    42.8 
SJ.A6ZJ AJ.A3QS EY.A1GP B5.A0K9 AJ.A3NC AJ.A3EM D1.A3JP FL.A1YG AJ.A3NE 
   42.9    43.1    43.1    43.4    43.5    43.8    43.8    44.0    44.1 
BK.A4ZD EY.A72D PG.A916 AJ.A3NE AX.A3FS PG.A914 PG.A915 AX.A2HH AX.A3FV 
   44.3    44.4    44.6    44.7    44.8    44.9    45.2    45.6    45.8 
A5.A1OH AJ.A3NH B5.A1MW B5.A5OD FL.A1YU AJ.A3NF AJ.A2QO AX.A3FZ B5.A1MS 
   46.0    46.0    46.3    46.7    47.0    47.1    47.3    47.3    47.3 
AJ.A3IA D1.A3JQ SL.A6JA KP.A3W4 EY.A3QX PG.A7D5 AJ.A3NG AJ.A23N KJ.A3U4 
   47.9    48.1    48.7    48.8    49.0    49.4    49.5    50.1    50.4 
A5.A3LO 5B.A90C B5.A0JN QF.A5YS SJ.A6ZI EO.A3KW BG.A3EW AJ.A6NU QS.A744 
   51.5    51.8    52.0    53.6    53.6    53.7    54.0    54.2    55.0 
A5.A7WJ BK.A6W3 A5.A7WK SL.A6J9 B5.A5OE 
   55.4    56.9    57.3    58.3    60.1 

3.2 Subsetting

It is known that in a lot of studies when we want to compare tumoral samples with normal ones those samples came from the same patient and that is the reason why we wanted to know if this is our case. We observe that 23 samples fullfil this condition, so that means that from these patients we have a tumor and normal sample. For this reason we can assume that we have paired data.

We can see that the number of paired samples is the same number as samples catalogued both as female and normals of the first preview of our data, so it makes sense.

Paired data, as we mentioned, comes from experiments where the samples are taken from the same subject so in further steps we could have the advantage that we avoid batch effects and we have a balanced set. To be able to know if a tumor sample and a normal one come from the same patient we have used as an identifier the bcr barcode.

frame_patients <- data.frame(table(colData(se)$bcr_patient_barcode))
table(frame_patients$Freq)

  1   2 
522  23 
#As we can observe, from the data we can only obtain 23 paired samples according to the barcode. Let's apply the filter to obtain the subset.
se_x2 <- se[, colData(se)$bcr_patient_barcode  %in% frame_patients$Var1[frame_patients$Freq == 2]] ## %in% operator in R, is used to identify if an element belongs to a vector, in this case, the patients that are found in a frequence of 2.
se_x2
class: RangedSummarizedExperiment 
dim: 20115 46 
metadata(5): experimentData annotation cancerTypeCode
  cancerTypeDescription objectCreationDate
assays(2): counts logCPM
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(46): TCGA.AJ.A2QL.01A.11R.A18M.07
  TCGA.AJ.A3NC.01A.11R.A22K.07 ... TCGA.DI.A2QY.11A.11R.A19W.07
  TCGA.E6.A1M0.11A.11R.A144.07
colData names(549): type bcr_patient_uuid ...
  lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
table(colData(se_x2)$type)

normal  tumor 
    23     23 

Let’s examine the sequencing depth again using the subset. Figure 2 below shows the sequencing depth per sample:

[1] 1240081311
[1] 26958289
Library sizes in increasing order (paired data).

Figure 2: Library sizes in increasing order (paired data)

Thanks to the subsetting we can see more clearly the graph that shows the depth of coverage for each sample. We can see how the samples are not fully grouped according to their origin (tumor or normal), but they are distributed throughout the graph having high and low coverage values for both normal and tumor samples.

It is necessary to clarify that, in this case, by doing this subsetting, we are ignoring lots of data (the majority), but in cancer studies using paired data allows us to obtain more accurate results, so we will begin with this approach and further analyse the data we exclude if necessary.

Also, usually the mean expression should be another threshold to include or exclude data. We can see 2 thresholds: the bottom one corresponding to the mean expression of the whole dataset, and the one on the top, corresponding to the mean depth of the subset sample. They both cutt-off the subset under the 19 most expressed samples, and it could be considered to filter just those, but for now, we have such a small subset of samples that we are going to mantain all of them, and study below if the low expression affects the analysis, or otherwise keep all these samples.

3.3 Distribution of expression levels among samples

Let’s look at the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately, and are shown in Figures 3, 4, 5

Non-parametric density distribution of expression profiles per sample.

Figure 3: Non-parametric density distribution of expression profiles per sample

We can see the distribution more clear in the next boxplots:
Non-parametric density distribution of expression profiles per normal sample.

Figure 4: Non-parametric density distribution of expression profiles per normal sample

Non-parametric density distribution of expression profiles per tumor sample.

Figure 5: Non-parametric density distribution of expression profiles per tumor sample

The distribution of expression levels across samples allows one to identify samples with distinctive RNA composition. In this case we can not observe differences in either kind of plot: the density plots have the same behaviour in tumor and normal samples and the boxplots behave more or less the same, we can also see that the behavior between samples is also very similar within the plots.

3.4 Distribution of expression levels among genes

Let’s calculate now the average expression per gene through all the paired samples. Figure 6 shows the distribution of those values across genes.

Distribution of average expression level per gene.

Figure 6: Distribution of average expression level per gene

The previous log CPM plot represents the typically bimodal distribution, with a low-CPM peak representing the most lowly-expressed genes and a high-CPM peak representing genes with a high level of expression. To filter out the lowly-expressed genes, we should choose a threshold between both peaks, and shown as a red line, logCPM = 1 was chosen as this threshold.

qq-plot

Figure 7: qq-plot

We can see with the previous QQ-plot that, indeed, with a threshold of 1 we pick only the genes that are not lowly-expressed (not the horizontal, negative values from the left) and that we also exclude the first part of the increasing exponentially values, which will allow us to avoid technical artifacts. So we will perform the subsetting of the genes using this threshold.

3.5 Filtering of lowly-expressed genes

RNA-seq expression profiles from lowly-expressed genes can lead to artifacts in downstream differential expression analyses, as mentioned. For this reason it is common practice to remove them.

And, as stated beforehand, the cuttoff of 1 log CPM unit was chosen, so all the genes with lower expression will not be considered.

[1] 20115    46
[1] 11638    46
Filtering of lowly-expressed genes

Figure 8: Filtering of lowly-expressed genes

In the figure 8 we can see a summary and a visual representation of the genes we kept: just the ones surpassing the cut-off of 1 (a little bit more than half the initial number of genes).

We can store un-normalized versions of the filtered expression data.

saveRDS(se.filt, "results/se.filt.unnorm.rds")
saveRDS(dge.filt, "results/dge.filt.unnorm.rds")

3.6 Normalization

3.6.1 Between

We calculate now the normalization factors on the filtered expression data set. The calcNormFactors is an EdgeR function to perform the between normalization with Trimmed Mean of M-values (TMM).

dge.filt <- calcNormFactors(dge.filt) #EdgeR function to performe the between normalization TMM.

3.6.2 Within

Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.

assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25) #Within normalization.

Store normalized versions of the filtered expression data.

saveRDS(se.filt, "results/se.filt.rds")
saveRDS(dge.filt, "results/dge.filt.rds")

3.7 MA-plots

MA-plots are useful to compare two group of samples. It concludes how different the two samples are in terms of read counts in RNA-seq experiments. So, we compare one sample at a time against the average of the rest of the samples

We examine now the MA-plots of the normalized expression profiles. Blue line is a reference of what we expect, red line indicates if we are above or below.

We look first to the tumor samples in Figure 9.

MA-plots of the tumor samples.

Figure 9: MA-plots of the tumor samples

We do not observe any sample that is very different from the others and that therefore has a strange behavior, sometimes some samples biases but normally the red line (mean of normalized values) is closer to the blue line (expected line).

Let’s look now to the normal samples in Figure 10.

MA-plots of the tumor samples.

Figure 10: MA-plots of the tumor samples

In this case we can observe that some of the normal samples have lines whome tails are not exactly fitting the expected line: maybe can cause problems in further steps of the analysis.

3.8 Batch identification

Batch effects can occur because measurements are affected by laboratory conditions, reagent lots, and personnel differences. This becomes a major problem when batch effects are confounded with an outcome of interest and lead to incorrect conclusions. So we have to detect the batch effect to know if the results that we’re going to obtain are reliable.

We will search now for potential surrogate of batch effect indicators. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples. Also we obtained all the codes definition in https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/.

ELEMENTS DATA:

TSS (tissue source site):

tss <- substr(colnames(se.filt), 6, 7)
table(tss) #Tissue source site
tss
AJ AX BG BK DI E6 
 8 20  6  6  4  2 

Center in which the samples are sequenced:

center <- substr(colnames(se.filt), 27, 28)
table(center) #Center in which the samples are sequenced 
center
07 
46 

Plates in which the samples are sequenced:

plate <- substr(colnames(se.filt), 22, 25) #Plates in which the samples are sequenced 
table(plate) 
plate
A00V A104 A109 A118 A137 A144 A16F A17B A18M A19W A22K A27V A32Y 
   1    2    1    3    6    2    2    4    6    2   10    5    2 

Analyte used to sequence:

portionanalyte <- substr(colnames(se.filt), 18, 20) #Analyte used to sequence 
table(portionanalyte)
portionanalyte
11R 12R 21R 22R 32R 33R 
 38   3   2   1   1   1 

Sample type:

samplevial <- substr(colnames(se.filt), 14, 16) #Type to which the samples belong 
table(samplevial)
samplevial
01A 11A 
 23  23 

From this information we can make the following observations:

  • All samples were sequenced at the same center with the 07 code (University of North Carolina)

  • All samples belong to one of two combinations of tissue type and vial, matching the expected tumor and normal distribution. (samplevial). Code definition for 01A : Primary Solid Tumor also called TP Code definition for 11A: Solid Tissue Normal.

  • Samples from Uterine Corpus Endometrial Carcinoma were collected across different tissue source sites (TSS). AJ: International Genomics Conosrtium AX: Gynecologic Oncology Group BG: University of Pittsburgh BK: Christiana Healthcare DI: MD Anderson E6: Roswell Park

  • The samples are sequenced within different plates.

  • The samples were not sequenced using the same analyte combinations.

Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with TSS, plate and portionalyte.

table(data.frame(TYPE=se.filt$type, TSS=tss))
        TSS
TYPE     AJ AX BG BK DI E6
  normal  4 10  3  3  2  1
  tumor   4 10  3  3  2  1
table(data.frame(TYPE=se.filt$type, plate=plate))
        plate
TYPE     A00V A104 A109 A118 A137 A144 A16F A17B A18M A19W A22K A27V A32Y
  normal    0    1    0    1    3    1    1    1    3    1    5    4    2
  tumor     1    1    1    2    3    1    1    3    3    1    5    1    0
table(data.frame(TYPE=se.filt$type, portionanalyte=portionanalyte))
        portionanalyte
TYPE     11R 12R 21R 22R 32R 33R
  normal  20   1   0   1   0   1
  tumor   18   2   2   0   1   0

When we classify the tumor and normal samples according to the TSS we can see that the number of tumor samples and the number of normal samples is the same for each TSS. This indicates that the tumor sample and the normal sample of each patient have been obtained in the same batch. Therefore, in this case the batch effect due to differences in the TSS are not affecting the data and is not a a source of expression variability.

To verify that the the TSS is not causing batch effects it is better to examine how the samples are grouped by hierarchical clustering and multidimensional scaling. We are going to repeat this verification with the other two surrogates.

First, we annotated the outcome of interest and the surrogate of batch indicator. We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts.

logCPM <- cpm(dge.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(tss))
batch2 <- as.integer(factor(plate)) 
batch3 <- as.integer(factor(portionanalyte)) 

Next we created the dendongram with the representation of the Hierarchical clustering of the samples for each of the surrogates. The resulting dendrograms are shown in Figures 11, 12, 13.

sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.filt)
outcome <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome) <- colnames(se.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples (TSS surrogate)")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(tss))), fill=sort(unique(batch)))
 Hierarchical clustering of samples (TSS surrogate)

Figure 11: Hierarchical clustering of samples (TSS surrogate)

sampleDendrogram2 <- as.dendrogram(sampleClustering, hang=0.1)
names(batch2) <- colnames(se.filt)
outcome2 <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome2) <- colnames(se.filt)
sampleDendrogram2 <- dendrapply(sampleDendrogram2,
                               function(x, batch2, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch2[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch2, outcome2)
plot(sampleDendrogram2, main="Hierarchical clustering of samples (Plate surrogate)")
legend("topright", paste("Batch", sort(unique(batch2)), levels(factor(plate))), fill=sort(unique(batch2)))
 Hierarchical clustering of samples (Plate surrogate)

Figure 12: Hierarchical clustering of samples (Plate surrogate)

sampleDendrogram3 <- as.dendrogram(sampleClustering, hang=0.1)
names(batch3) <- colnames(se.filt)
outcome3 <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome3) <- colnames(se.filt)
sampleDendrogram3 <- dendrapply(sampleDendrogram3,
                               function(x, batch3, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch3[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch3, outcome3)
plot(sampleDendrogram3, main="Hierarchical clustering of samples (Portionanalyte surrogate)")
legend("topright", paste("Batch", sort(unique(batch3)), levels(factor(portionanalyte))), fill=sort(unique(batch3)))
 Hierarchical clustering of the samples (Portionanalyte surrogate).

Figure 13: Hierarchical clustering of the samples (Portionanalyte surrogate)

One of the main observations that can be seen in the dendrogram is that the samples are grouped according to whether they are tumoral and normal, and therefore there is no batch effect observed due to TSS, plate and portionalyte.

library(RColorBrewer)
type <- as.integer(factor(se.filt$type))

plotMDS(dge.filt, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(tss))),
       fill=sort(unique(batch)), inset=0.05)
 Multidimensional scaling plot of the samples (TSS surrogate).

Figure 14: Multidimensional scaling plot of the samples (TSS surrogate)

plotMDS(dge.filt, labels=outcome, col=batch2)
legend("bottomright", paste("Batch", sort(unique(batch2)), levels(factor(plate))),
       fill=sort(unique(batch2)), inset=0.05)
 Multidimensional scaling plot of the samples (Plate surrogate).

Figure 15: Multidimensional scaling plot of the samples (Plate surrogate)

plotMDS(dge.filt, labels=outcome, col=batch3)
legend("bottomleft", paste("Batch", sort(unique(batch3)), levels(factor(portionanalyte))),
       fill=sort(unique(batch3)), inset=0.05)
 Multidimensional scaling plot of the samples (Portionanalyte surrogate).

Figure 16: Multidimensional scaling plot of the samples (Portionanalyte surrogate)

In the multidimensional plot (Figures 14, 15, 16) we can see how the samples are grouped according to whether they are normal or tumoral. We notice that there is a sample that is separated from the rest, A2HC-tumor. It is clear that it is not clustering with all the other tumor samples, but it is also clear that it is not due to batch effect (in any of three cases studied), so we are going to keep the sample by now, but keeping in mind that a further study of this differential behaviour would be interesting.

4 Differential expression analysis

In the next step we are going to identify changes in gene expression. The analysis will allow us to answer the next question:

“What is the number of genes that change their activity between tumor and control samples?”

And in further studies we will know what genes are the ones that change their activity.

We are going to work with the normalized gene expression data set that we have obtained in the step before.

4.1 Simple Differential expression analysis

We perform a simple examination of expression changes and their associated p-values using the R/Bioconductor package sva.

Surrogate variable analysis (SVA) is a technique that tries to capture sources of heterogeneity in high-throughput profiling data, such as non-biological variability introduced by batch effects.

SVA provides a function to quickly perform F-tests for detecting genes significantly changing between the full and null models. This enables a quick overview of the impact of adjusting for the estimated heterogeneity.

  • We examine first how many genes change across conditions without adjustment. It will be useful in order to compare the model with and without surrogate variables.
library(sva)
mod <- model.matrix(~ se.filt$type, colData(se.filt))
mod0 <- model.matrix(~ 1, colData(se.filt))
pv <- f.pvalue(assays(se.filt)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.01)
[1] 5570

There are 5570 genes significantly changing their expression at FDR < 1%. In Figure 17 below we show the distribution of the resulting p-values.

Distribution of raw p-values for an F-test on every gene between tumor and normal samples.

Figure 17: Distribution of raw p-values for an F-test on every gene between tumor and normal samples

In the graph we can see the distribution of raw p-values for an F-test on every gene between tumor and normal samples.

A distribution of p -values under the null hypothesis should be uniform. Assuming that most genes are not differentially expressed (DE), the previous histogram should have looked uniform with a peak at low p -values for the truly DE genes. Departures from this indicate data heterogeneity.

  • Now, let’s estimate surrogate variables using the sva() function.
sv <- sva(assays(se.filt)$logCPM, mod, mod0)
Number of significant surrogate variables is:  11 
Iteration (out of 5 ):1  2  3  4  5  
sv$n
[1] 11

The SVA algorithm has found 11 surrogate variables. Let’s use them to assess again the extent of differential expression this time adjusting for these surrogate variables.

modsv <- cbind(mod, sv$sv)
mod0sv <- cbind(mod0, sv$sv)
pvsv <- f.pvalue(assays(se.filt)$logCPM, modsv, mod0sv)
sum(p.adjust(pvsv, method="fdr") < 0.01)
[1] 7040

We have increased the number of DE genes (increase of statistical power) to 7040, as we can see, 1470 more genes have been detected as differentially expressed than before (5570 genes). Figure 18 shows the resulting distribution of p-values.

Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.

Figure 18: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA

After adjusting for surrogate variables using SVA the p -value distribution is more uniform now for the non-DE genes.

4.2 Differential expression analysis based in Fold-change

In this section we can examine gene-by-gene changes in expression by comparing expression values between two groups of samples, tumoral and normal ones.

Calculate log fold-changes between tumoral and normal samples for all genes using the rowMeans() function:

MA-plot

Figure 19: MA-plot

In the plot 19 Normal vs. Tumor, if all genes were equally expressed in both samples, the dots (each dot represents one gene) would map a perfect diagonal line. We could observe, though, that there are dots outside the line. It is normal to have some area over and above the line, but it is clear that there are some points way outside the diagonal. Those one would be the ones we are looking for, to see if they are statistically differentially expressed.

In the left, the dots over the diagonal would be those overexpressed in normal and underexpressed in tumor samples, and the ones under the diagonal, the other way around.

In the right plot, which shows the same idea, as the y axis is a tumor-normal take away, all dots over the horizontal 0 line will be overexpressed in tumor samples and underexpressed in normal ones, and the other way around with the dots far away and under the 0 threshold.

4.3 Volcano Plot

We used the volcano plot (https://en.wikipedia.org/wiki/Volcano_plot_(statistics) in order to plot the raw p-values as function of their fold-changes,both in logarithmic scale, 20.

Volcano plot

Figure 20: Volcano plot

In the previous figure we can see, in red, the differentially overexpressed (to the right) and underexpressed (to the left) genes, although it’s just a first visualization idea, based on the log fold-change approach, and further analysis will be done in the genes to verify this.

4.4 Significance

A first straightforward approach to differential expression analysis is to simply rank genes by decreasing absolute value of log 2 fold changes and call DE those at the top (e.g., the top 10) of the ranking:

log2fc <- tumExp - normalExp
ranking <- order(abs(log2fc), decreasing = TRUE)
head(data.frame(Log2FC = round(log2fc[ranking], digits = 3), FC = round(2^log2fc[ranking], 
    digits = 3), `1/FC` = round(2^(-log2fc[ranking]), digits = 3), row.names = rowData(se.filt)$symbol[ranking], 
    check.names = FALSE), n = 10)
        Log2FC      FC    1/FC
MPLKIP  -9.201   0.002 588.432
ADH1B   -8.513   0.003 365.353
DFFA    -8.454   0.003 350.699
PIK3C2A -8.143   0.004 282.677
RNU5E-1  7.175 144.489   0.007
OR4K5   -6.982   0.008 126.444
MIR4298 -6.919   0.008 121.001
SAPCD2  -6.910   0.008 120.233
CNNM1   -6.679   0.010 102.486
XG      -6.549   0.011  93.632

Example of result: MPLKIP is 588.432 more expressed in tumor samples than in normal ones so we can say that is an overexpressed gene in the cancer that we are studying.

5 DE analysis with linear regression models.

We are going to perform a curated DE analysis using a linear regression analysis, a statistical technique that attempts to explore and model the relationship between two or more variables. We will use the package of limma in order to perform the linear model.

5.0.1 Differential expression with LIMMA

A general workflow with limma (http://bioconductor.org/packages/limma) for RNA-seq data consists of the following steps: 1. Build design matrix: model.matrix() 2. Calculate observational-level weights (if needed): voom() 3. Estimate correlation in repeated measurements of a blocking factor (if needed): duplicateCorrelation() 4. Fit linear model: lmFit() 5. Build contrast matrix (if needed): makeContrasts() 6. Fit contrasts (if needed): contrasts.fit() 7. Calculate moderated t -statistics ( trend=TRUE if needed): eBayes() 8. Output results: topTable()

We are not to follow all the mentioned steps of the general workflow. Below we will mention the process carried out and the results obtained.

5.0.2 Adjust for the mean-variance relationship

RNA-seq counts may vary considerably from sample to sample for the same gene and different samples may be sequenced to different depths. This may lead to identical CPM values for very different count sizes and motivates the need to model the mean-variance trend of log-CPM values at the individual observation level.

Two ways to incorporate the mean-variance relationship into DE analysis with linear models:

  • limma-trend: modify limma’s empirical Bayes procedure to incorporate a mean-variance trend, i.e., in the call to eBayes() , squeezing genewise variances towards a global mean-variance trend, instead of doing it towards a constant pooled variance.
  • limma-voom: calculate precision weights for each gene-by-sample normalized expression value using a new function voom() . Use the resulting matrix of precision weights, jointly with the matrix of expression values, to fit the linear models, i.e., in the call to lmFit() .

A paired design arises from experiments where there are two measurements per individual and thus one speaks of paired measurements. This is also a common setting when each individual already carries the two conditions of interest (e.g., tissues), and therefore, we can exploit this circumstance to adjust for any characteristics that may affect measurements (e.g., sex, age, BMI, genetic background). To set a paired design, the formula that builds the design matrix should include the pair indicator variable.

table(substr(se.filt$bcr_patient_barcode,1,12))

TCGA-AJ-A2QL TCGA-AJ-A3NC TCGA-AJ-A3NE TCGA-AJ-A3NH TCGA-AX-A05Y 
           2            2            2            2            2 
TCGA-AX-A0IZ TCGA-AX-A0J0 TCGA-AX-A1CF TCGA-AX-A1CI TCGA-AX-A1CK 
           2            2            2            2            2 
TCGA-AX-A2H8 TCGA-AX-A2HA TCGA-AX-A2HC TCGA-AX-A2HD TCGA-BG-A2AD 
           2            2            2            2            2 
TCGA-BG-A3EW TCGA-BG-A3PP TCGA-BK-A0CB TCGA-BK-A13C TCGA-BK-A4ZD 
           2            2            2            2            2 
TCGA-DI-A2QU TCGA-DI-A2QY TCGA-E6-A1M0 
           2            2            2 

In the previous table, the colums are the samples (the 23 individuals we have been working with) and the row is the number of samples for individual (corroborating that the model will take 2 samples per individual, the tumor and normal ones).

When building the design matrix by default we could observe that the reference level of the factor variable is normal. We decided to change it to tumor so that positive values of logFC would mean genes overexpressed in cancer:

se.filt$type <- relevel(se.filt$type, ref="tumor")

5.0.3 Adjust for the mean-variance relationship (LIMMA-TREND)

Use limma to build the design matrix, fit the model, calculate the moderate t-statistic and obtain the DE genes.

library(limma) #Allow us to obtain differentially expressed genes using linear regresion models.
#Build the design matrix.
mod<- model.matrix(~type + substr(se.filt$bcr_patient_barcode, 1, 12), data = colData(se.filt))
head(mod, n=3)
                             (Intercept) typenormal
TCGA.AJ.A2QL.01A.11R.A18M.07           1          0
TCGA.AJ.A3NC.01A.11R.A22K.07           1          0
TCGA.AJ.A3NE.01A.11R.A22K.07           1          0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      1
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      1
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
# substr(se.filt$bcr_patient_barcode, 1, 12) is the same than just bcr_patient_barcode
#Fit the linear model with the CPM values
fit1 <- lmFit(assays(se.filt)$logCPM, mod)
#Calculate moderated t -statistics using the mean-variance trend:
fit1 <- eBayes(fit1, trend = TRUE)
#Examine the extent of differential expression:
FDRcutoff <- 0.01
res1 <- decideTests(fit1, p.value = FDRcutoff)
summary(res1)
       (Intercept) typenormal
Down             4       2819
NotSig        1707       5974
Up            9927       2845
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0
Down                                                        0
NotSig                                                  11638
Up                                                          0
dim(res1)
[1] 11638    24

The first block of data represents the coefficients of the model generated (the intercept) and a global summary of what kind of expression we have, compared to tumor type samples. So, around 3000 upstream and downstream regulated genes, and around 6000 for not significantly DE genes.

And the last row is a summary, with our total gene set (around 12000, which fits the 2*3000+6000 summary from the beggining) and 24 columns in the table. Being the first column the block explained before and each of the others a summary for each individual, where the columns is every ID, and the rows are their specific up, down regulated and not significantly expressed genes.

For the genes, we can also take a look at some meta-data, such as the chromosome they are located in:

Add gene meta data:

#Add gene metadata and fetch table of results:
genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[,1], stringsAsFactors = FALSE)
fit1$genes <- genesmd
tt1 <- topTable(fit1, coef = 2, n = Inf)
head(tt1)
         chr  symbol     logFC  AveExpr         t      P.Value    adj.P.Val
663    chr15   BNIP2  5.903993 4.499887  24.21722 3.902416e-19 2.816480e-15
7251   chr11  TSG101  4.115254 5.904267  24.00524 4.840145e-19 2.816480e-15
390197 chr11  OR4D10 -6.216636 3.637634 -22.59390 2.125834e-18 8.246820e-15
144348 chr12  ZNF664 -5.597645 3.454581 -20.22987 3.092343e-17 8.997173e-14
221692  chr6 PHACTR1 -4.735478 1.851592 -19.90986 4.538549e-17 1.056393e-13
55129   chr3   ANO10 -3.765250 3.464546 -19.61277 6.512421e-17 1.263193e-13
              B
663    33.64026
7251   33.43565
390197 32.02226
144348 29.43530
221692 29.06176
55129  28.70961
sort(table(tt1$chr[tt1$adj.P.Val < FDRcutoff]), decreasing = TRUE)

                chr1                chr19                chr11 
                 597                  405                  396 
                chr2                 chr3                chr17 
                 381                  325                  320 
               chr12                 chr7                 chr9 
                 295                  282                  242 
                chr5                 chr6                chr16 
                 241                  240                  233 
                chrX                chr14                chr10 
                 207                  202                  192 
                chr8                 chr4                chr15 
                 188                  182                  176 
               chr20                chr22                chr13 
                 175                  121                  105 
               chr21                chr18                 chrY 
                  78                   71                    6 
 chr1_KI270766v1_alt chr17_GL000258v2_alt chr19_KI270921v1_alt 
                   1                    1                    1 
    chrUn_GL000218v1 
                   1 
DEgenes1 <- rownames(tt1)[tt1$adj.P.Val < FDRcutoff]
DEgenes_symbol1 <- tt1$symbol[tt1$adj.P.Val < FDRcutoff]
length(DEgenes1)
[1] 5664

The importance of knowing in which chromosome the genes are located may be that it could happen that a specific or a few chromosomes are affecting more to this kind of cancer than others. We will take special attention to the chromosomes of the final differentially expressed genes, but, as an overview, we can see that chromosome 1 is the one with most genes involved.

The other important comment about the last table is that there are 12 genes related to Y chromosome, but this samples were all taken from females, so no chromosome Y should had appeared. We will consider it just as a mapping mistake of the reads, but we don’t consider it’s going to affect the analysis, as it’s the chromosome with less genes associated.

And the same happens with the last 4 chromosomes. Some reads were map to those specific chromosomes, that were not correctly identified by the software. In the 3 first cases we can see that the chromosome is annotated, but in the last, not even so. As there is just one gene in each case, we are also not going to consider that as a problem.

Examine diagnostic plots for limma DE analysis with limma-trend in figure 21:
Diagnostic plots for Limma-trend DE analysis

Figure 21: Diagnostic plots for Limma-trend DE analysis

In the previous figure we can visualize the limma-trend results. On the left histogram we can observe a uniformity with a peak at low p-values for the truly DE genes.

In the plot on the right, the black dotted line is formed by all the genes found in this approach that are DE genes.

We are going to perform either the limma-voom approach and for the adjusment of unknown covariates, and compare afterwards the results, to see which one results better for our samples.

5.0.4 Adjust for the mean-variance relationship (LIMMA-VOOM)

Limma-voom 22: calculate precision weights for each gene-by-sample normalized expression value using a new function voom(). Use the resulting matrix of precision weights, jointly with the matrix ofexpression values, to fit the linear models, i.e., in the call to lmFit().

Voom:Mean variance trend

Figure 22: Voom:Mean variance trend

We can observe the mean-variance relation at gene level. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline.

Differential expression is then performed as usual by using limma package.

#Fit the linear model using the voom weights:
fit_v <- lmFit(v, mod)
#Calculate the moderated t -statistics:
fit_v <- eBayes(fit_v)

#Examine the extent of differential expression at 1% FDR:
res_v <- decideTests(fit_v, p.value=FDRcutoff)
summary(res_v)
       (Intercept) typenormal
Down             8       2961
NotSig        1871       5971
Up            9759       2706
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
Down                                                        0
NotSig                                                  11637
Up                                                          1
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0
Down                                                        0
NotSig                                                  11638
Up                                                          0
#Addition of gene metadata and fetch table of results:
genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[,1], stringsAsFactors = FALSE)
fit_v$genes <- genesmd
tt_v <- topTable(fit_v, coef = 2, n = Inf)
head(tt_v)
         chr  symbol     logFC  AveExpr         t      P.Value    adj.P.Val
663    chr15   BNIP2  5.839693 4.504789  23.09178 9.879978e-19 1.149832e-14
7251   chr11  TSG101  4.029071 5.905098  22.25970 2.433464e-18 1.416033e-14
390197 chr11  OR4D10 -6.101170 3.648108 -21.51151 5.619566e-18 2.180017e-14
779     chr1 CACNA1S  4.668190 5.161293  20.04532 3.134104e-17 7.294941e-14
221692  chr6 PHACTR1 -4.650689 1.871502 -20.09904 2.937143e-17 7.294941e-14
55129   chr3   ANO10 -3.787187 3.468596 -19.67843 4.903007e-17 9.510199e-14
              B
663    32.56053
7251   31.98948
390197 30.81490
779    29.43487
221692 29.10236
55129  28.87207
DEgenes_v <- rownames(tt_v)[tt_v$adj.P.Val < FDRcutoff]
DEgenes_symbolv <- tt_v$symbol[tt_v$adj.P.Val < FDRcutoff]
length(DEgenes_v)
[1] 5667
sort(table(tt_v$chr[tt_v$adj.P.Val < FDRcutoff]), decreasing = TRUE)

                chr1                chr19                chr11 
                 585                  401                  399 
                chr2                 chr3                chr17 
                 389                  331                  328 
               chr12                 chr7                 chr5 
                 298                  282                  242 
                chr6                 chr9                chr16 
                 239                  237                  229 
                chrX                chr14                chr10 
                 211                  205                  194 
                chr4                 chr8                chr20 
                 189                  183                  175 
               chr15                chr22                chr13 
                 170                  116                  105 
               chr21                chr18                 chrY 
                  77                   73                    5 
 chr1_KI270766v1_alt chr17_GL000258v2_alt chr19_KI270921v1_alt 
                   1                    1                    1 
    chrUn_GL000218v1 
                   1 

After the same previous steps (generating the matrix model), we can now visualize the results.

Examine diagnostic plots for limma DE analysis with voom weights 23:

Diagnostic plots for Limma-voom DE analysis

Figure 23: Diagnostic plots for Limma-voom DE analysis

Same as with the limma-trend approach, we can see, first, a diagram of p-values with a peak for the low ones (differentially expressed) and a more or less constant frequency for the other ones. This means that the approach is good.

In the right plot, we can also observe similar results than in the limma-trend one.

The next step in our differential expression ananlysis consists in correcting the model for unkown covariates, adding them to the model.

5.1 Adjust for unknown covariates

mod<- model.matrix(~type + substr(se.filt$bcr_patient_barcode, 1, 12), data = colData(se.filt))
mod0 <- model.matrix(~ substr(se.filt$bcr_patient_barcode, 1, 12), colData(se.filt))
#estimation  of surrogate variables from the log-CPM values calculated by voom
sv <- sva(v$E, mod = mod, mod0=mod0)
Number of significant surrogate variables is:  7 
Iteration (out of 5 ):1  2  3  4  5  
sv$n 
[1] 7
mod_u <- cbind(mod, sv$sv)
colnames(mod_u) <- c(colnames(mod_u)[1:24], paste0("SV", 1:sv$n))
head(mod_u,n=5)
                             (Intercept) typenormal
TCGA.AJ.A2QL.01A.11R.A18M.07           1          0
TCGA.AJ.A3NC.01A.11R.A22K.07           1          0
TCGA.AJ.A3NE.01A.11R.A22K.07           1          0
TCGA.AJ.A3NH.01A.11R.A22K.07           1          0
TCGA.AX.A05Y.01A.11R.A00V.07           1          0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      1
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      1
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      1
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      1
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NH.01A.11R.A22K.07                                                      0
TCGA.AX.A05Y.01A.11R.A00V.07                                                      0
                                      SV1         SV2         SV3
TCGA.AJ.A2QL.01A.11R.A18M.07  0.136756052 -0.11169669 -0.17005068
TCGA.AJ.A3NC.01A.11R.A22K.07  0.103196313  0.07265497 -0.10258811
TCGA.AJ.A3NE.01A.11R.A22K.07 -0.008643974  0.14154657 -0.16342623
TCGA.AJ.A3NH.01A.11R.A22K.07  0.029027736 -0.04019069 -0.09629501
TCGA.AX.A05Y.01A.11R.A00V.07  0.090712161  0.04723105  0.21411584
                                     SV4          SV5          SV6
TCGA.AJ.A2QL.01A.11R.A18M.07  0.09681766 -0.346170514  0.003583838
TCGA.AJ.A3NC.01A.11R.A22K.07  0.31508178 -0.006150784  0.091382487
TCGA.AJ.A3NE.01A.11R.A22K.07  0.15806919  0.198620038  0.220746254
TCGA.AJ.A3NH.01A.11R.A22K.07 -0.11195618  0.207936216 -0.132124399
TCGA.AX.A05Y.01A.11R.A00V.07 -0.35021738 -0.281424109  0.058073372
                                      SV7
TCGA.AJ.A2QL.01A.11R.A18M.07 -0.008817287
TCGA.AJ.A3NC.01A.11R.A22K.07 -0.015456258
TCGA.AJ.A3NE.01A.11R.A22K.07 -0.387483932
TCGA.AJ.A3NH.01A.11R.A22K.07  0.135429475
TCGA.AX.A05Y.01A.11R.A00V.07  0.028333391

There are 9 SVs. Second, we add these SVs to the design matrix, which has the following form:

fit_u <- lmFit(v, mod_u)
fit_u <- eBayes(fit_u)

We examine the extent of differential expression at 1% FDR:

res_u <- decideTests(fit_u, p.value = FDRcutoff)
summary(res_u)
       (Intercept) typenormal
Down             1       3371
NotSig        1643       5098
Up            9994       3169
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
Down                                                        0
NotSig                                                  11637
Up                                                          1
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
Down                                                        0
NotSig                                                  11637
Up                                                          1
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
Down                                                        1
NotSig                                                  11637
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
Down                                                        0
NotSig                                                  11638
Up                                                          0
       substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0   SV1   SV2
Down                                                        0   706   228
NotSig                                                  11638  9968 10965
Up                                                          0   964   445
         SV3   SV4   SV5   SV6   SV7
Down     476    19     2    19     0
NotSig 10590 11605 11636 11609 11638
Up       572    14     0    10     0
#Finally, the metadata is added, the table of results is computed and the number of DE genes is calculated:

fit_u$genes <- genesmd
tt_u <- topTable(fit_u, coef = 2, n = Inf)
saveRDS(tt_u, file.path("results", "tt_u.rds"))
DEgenes_u <- rownames(tt_u)[tt_u$adj.P.Val < FDRcutoff]
DEgenes_symbol_u <- tt_u$symbol[tt_u$adj.P.Val < FDRcutoff]
length(DEgenes_u)
[1] 6540

Though it may not look significant, the DE genes have globally increased, as there are more or less 3500 up and down regulated genes, and just 5000 not DE.

We can also observe the chromosome distribution of DE genes.

sort(table(tt_u$chr[tt_u$adj.P.Val < FDRcutoff]), decreasing = TRUE)

                chr1                chr19                chr11 
                 682                  466                  452 
                chr2                 chr3                chr17 
                 440                  380                  365 
               chr12                 chr7                 chr6 
                 339                  317                  290 
                chr5                chr16                 chr9 
                 286                  273                  269 
                chrX                chr14                 chr4 
                 244                  239                  226 
               chr10                 chr8                chr20 
                 221                  218                  204 
               chr15                chr22                chr13 
                 193                  142                  121 
               chr18                chr21                 chrY 
                  81                   80                    8 
 chr1_KI270766v1_alt chr17_GL000258v2_alt chr19_KI270921v1_alt 
                   1                    1                    1 
    chrUn_GL000218v1 
                   1 
head(mod, n=3)
                             (Intercept) typenormal
TCGA.AJ.A2QL.01A.11R.A18M.07           1          0
TCGA.AJ.A3NC.01A.11R.A22K.07           1          0
TCGA.AJ.A3NE.01A.11R.A22K.07           1          0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      1
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      1
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0
                             substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0
TCGA.AJ.A2QL.01A.11R.A18M.07                                                      0
TCGA.AJ.A3NC.01A.11R.A22K.07                                                      0
TCGA.AJ.A3NE.01A.11R.A22K.07                                                      0

We can also examine the corresponding plots of this model in @ref(fig: diagnostic-plot-Adjust-for-unknown-covariates:

Diagnostic plots for Adjust for unknown covariates

Figure 24: Diagnostic plots for Adjust for unknown covariates

The plots look similar again. In the left figure we see highly differenciated peak for DE genes. And in the right one, we can see black dots 20 quantiles far away from the red line. We are specially interested in the standing alone dots of the extrems.

5.2 Volcano plot of different used models

The so-called volcano plot is a widely used diagnostic plot in DE analysis to assess the extent of DE by plotting the raw p-values as function of their fold-changes, both in logarithmic scale.

As all approaches were more or less the same, we can visualize the most differentially expressed genes in all of them, with a volcano plot. We showed before the figure, but here we can show the specific most DE genes.

Volcano plot for Model 1 25:
Model 1 - Volcano plots of DE

Figure 25: Model 1 - Volcano plots of DE

         chr  symbol     logFC  AveExpr         t      P.Value    adj.P.Val
663    chr15   BNIP2  5.903993 4.499887  24.21722 3.902416e-19 2.816480e-15
7251   chr11  TSG101  4.115254 5.904267  24.00524 4.840145e-19 2.816480e-15
390197 chr11  OR4D10 -6.216636 3.637634 -22.59390 2.125834e-18 8.246820e-15
144348 chr12  ZNF664 -5.597645 3.454581 -20.22987 3.092343e-17 8.997173e-14
221692  chr6 PHACTR1 -4.735478 1.851592 -19.90986 4.538549e-17 1.056393e-13
              B
663    33.64026
7251   33.43565
390197 32.02226
144348 29.43530
221692 29.06176

Volcano plot for Model 2 26:

Model 2 - Volcano plots of DE

Figure 26: Model 2 - Volcano plots of DE

         chr  symbol     logFC  AveExpr         t      P.Value    adj.P.Val
663    chr15   BNIP2  5.839693 4.504789  23.09178 9.879978e-19 1.149832e-14
7251   chr11  TSG101  4.029071 5.905098  22.25970 2.433464e-18 1.416033e-14
390197 chr11  OR4D10 -6.101170 3.648108 -21.51151 5.619566e-18 2.180017e-14
779     chr1 CACNA1S  4.668190 5.161293  20.04532 3.134104e-17 7.294941e-14
221692  chr6 PHACTR1 -4.650689 1.871502 -20.09904 2.937143e-17 7.294941e-14
              B
663    32.56053
7251   31.98948
390197 30.81490
779    29.43487
221692 29.10236

Volcano plot for Model 3 27:

Model 3 - Volcano plots of DE

Figure 27: Model 3 - Volcano plots of DE

         chr  symbol     logFC  AveExpr         t      P.Value    adj.P.Val
23094  chr19 SIPA1L3  6.273280 4.374782  26.42984 2.255357e-16 1.438171e-12
390197 chr11  OR4D10 -6.185478 3.648108 -26.14472 2.753457e-16 1.438171e-12
779     chr1 CACNA1S  4.546686 5.161293  25.72513 3.707263e-16 1.438171e-12
663    chr15   BNIP2  5.965787 4.504789  25.25117 5.216656e-16 1.517786e-12
81626   chr1 SHCBP1L -2.171171 7.764988 -24.61222 8.348164e-16 1.943119e-12
              B
23094  27.35205
390197 27.16273
779    27.14925
663    26.63682
81626  26.44553

The volcano plots for the three models look quite similar, but we can see that the top DE genes are not the same. Nevertheless, it is important to considerate that some top genes do appear in all the figures, which makes them more likely to be correctly identified. For the overexpressed top genes (in the right half of the figures), BNIP2 appears in all of them, while CACNA1S and TSG10 appear in two out of three. On the other hand, for the underexpressed genes, OR4D10 is also robust in all the models.

The previous figure 28 is another way of showing the results for model 3. This time, overexpressed genes are in black over the 0 horizontal threshold, and the underexpressed are below it. The red dots are the non-DE genes.

MA plot for the Model 3

Figure 28: MA plot for the Model 3

As we can observe in the different examination diagnostic plots the results obtained with the different models are similar. However Model 3 (Adjust for unknown covariates) is the one that is going to be used for the Funtional Enrichment Analysis as the outcome of interest is not confounded with other sources of variation. With this method we could detect the highest number of DE genes (6540).

length(DEgenes1)
[1] 5664
length(DEgenes_v)
[1] 5667
length(DEgenes_u)
[1] 6540
saveRDS(DEgenes_u, file.path("results", "DEgenes.rds"))

6 Functional analysis

Given a list of differentially expressed (DE) genes, we may know the function and role of some of those genes within the molecular process under study. This may already shed light on what is being investigated. However, even if we knew what every DE gene is doing, we would need to frame their activity within the pathways in which they are participating to make a hypothesis about why they are changing.

7 The Gene Ontology analysis

There are several R packages at CRAN/Bioconductor that facilitate performing a functional enrichment analysis on the entire collection of GO gene sets. We are going to illustrate this analysis with the Bioconductor package GOstats (http://www.bioconductor.org/packages/release/bioc/html/GOstats.html). Doing this analysis with GOstats (http://www.bioconductor.org/packages/release/bioc/html/GOstats.html) consists of the following three steps: 1. Build a parameter object with information specifying the gene universe, the set of DE genes, the annotation package to use, etc. 2. Run the functional enrichment analysis. 3. Store and visualize the results.

  1. Build a parameter object:
library(GOstats)
Loading required package: Category
Loading required package: Matrix

Attaching package: 'Matrix'
The following object is masked from 'package:S4Vectors':

    expand
Loading required package: graph

Attaching package: 'graph'
The following object is masked from 'package:XML':

    addNode

Attaching package: 'GOstats'
The following object is masked from 'package:AnnotationDbi':

    makeGOGraph
library(org.Hs.eg.db)
#Gene Universe
geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11638
#Differential expressed genes:
length(DEgenes_u)
[1] 6540
#Overexpressed genes in tumor samples:
over_DEgenes <- rownames(tt_u)[tt_u$adj.P.Val < FDRcutoff & tt_u$logFC > 0]
length(over_DEgenes)
[1] 3169
#Underexpressed genes in tumor samples:
under_DEgenes <- rownames(tt_u)[tt_u$adj.P.Val < FDRcutoff & tt_u$logFC < 0]
length(under_DEgenes)
[1] 3371
params <- new("GOHyperGParams", geneIds=DEgenes_u, universeGeneIds=geneUniverse, annotation="org.Hs.eg.db", ontology="BP",pvalueCutoff=0.05,testDirection="over")

over_params <- new("GOHyperGParams", geneIds=over_DEgenes,universeGeneIds=geneUniverse,annotation="org.Hs.eg.db", ontology="BP", pvalueCutoff=0.05, testDirection="over")

under_params <- new("GOHyperGParams", geneIds=under_DEgenes,universeGeneIds=geneUniverse,annotation="org.Hs.eg.db", ontology="BP", pvalueCutoff=0.05, testDirection="over")

The four previous parameters represent: - Gene universe (N): the total, filtered genes we considered in the DE analysis (11638). - DE genes (m): the result from the previous analysis, the statistically significant DE genes (6540). - The overexpressed genes, a subset of the DE genes (as mentioned, possitive logFC, which means they appear in tumor samples more frequently than in normal ones), with 3169 in total. - The underexpressed genes. Same as before but for negative logFC values. This genes are less frequent in tumor samples (3398).

  1. Run the functional enrichment analysis. Using this method we perform the Fisher analisis and the anotation all together.
conditional(params) <- TRUE
hgOver <- hyperGTest(params)
hgOver
Gene to GO BP Conditional test for over-representation 
12164 GO BP ids tested (187 have p < 0.05)
Selected gene set size: 5329 
    Gene universe size: 9394 
    Annotation package: org.Hs.eg 
  1. Store and visualize the results.
#Store the html
htmlReport(hgOver, file = "gotests.html")
browseURL("gotests.html")

We can see that the gene univers is smaller than before because not all the gens have GO anotations or ids, so the package could not identify or annotate all of them.

After filtering genes without GO id, we can get a first look to the results:

#Visualize the results in R:
goresults <- summary(hgOver)
head(goresults)
      GOBPID       Pvalue OddsRatio  ExpCount Count Size
1 GO:0008203 0.0002070372  2.639067  40.27667    55   71
2 GO:0016331 0.0003739635  2.303836  47.65127    63   84
3 GO:1902653 0.0003778520  4.443019  19.28742    29   34
4 GO:0006066 0.0004892316  1.705489 102.10986   124  180
5 GO:0060627 0.0013217691  1.476800 155.43389   180  274
6 GO:0016126 0.0016370289  3.282021  20.98925    30   37
                                      Term
1            cholesterol metabolic process
2    morphogenesis of embryonic epithelium
3   secondary alcohol biosynthetic process
4                alcohol metabolic process
5 regulation of vesicle-mediated transport
6              sterol biosynthetic process
#To try to spot the more interesting and reliable GO terms we can filter the previous results by a minimum value on the Count and Size columns, a maximum Count value, and order them by the OddsRatio column:

#We used 10 for the counts because the number of count was bigger than 10.
goresults <- goresults[goresults$Size >= 10 & goresults$Size <= 300 & goresults$Count >= 10, ]
goresults <- goresults[order(goresults$OddsRatio, decreasing=TRUE), ]
#Store the html (filtered ones)
dim(goresults)
[1] 117   7
head(goresults)
       GOBPID      Pvalue OddsRatio ExpCount Count Size
51 GO:0032700 0.018333292  7.640534 6.240047    10   11
52 GO:0043162 0.018333292  7.640534 6.240047    10   11
53 GO:0071711 0.018333292  7.640534 6.240047    10   11
54 GO:0099632 0.018333292  7.640534 6.240047    10   11
15 GO:0045540 0.006000840  5.740863 9.639007    15   17
18 GO:0048701 0.006038347  5.734381 9.643709    15   17
                                                                                        Term
51                                          negative regulation of interleukin-17 production
52 ubiquitin-dependent protein catabolic process via the multivesicular body sorting pathway
53                                                            basement membrane organization
54                                                  protein transport within plasma membrane
15                                            regulation of cholesterol biosynthetic process
18                                                  embryonic cranial skeleton morphogenesis

After filtering a minimum count value, we can see that the top results are more related to our disease, or to general cancer processes than the previous one.

Several studies have proved that “IL-17” plays an important role in the pathophysiology of cancer, from tumorigenesis, proliferation, angiogenesis, and metastasis, to adapting the tumour in its ability to confer upon itself both immune, and chemotherapy resistance .

The second GO term represented in the above table that is also called “Ubiquitin-dependent protein catabolic process”" via the MVB pathway involve genes that are important on cancer tumorigenesis and/or progression such as “TSG101” . In previous studies in endometrial cancer patients, aberrant splicing of “TSG101”" gene appeared to be identified more frequently in cancerous than in non-cancerous tissues. Furthermore, functional proteomic analysis of genetically-defined human ovarian cancer models revealed that “TSG101”" is dysregulated in human ovarian epithelial cells expressing oncogenic “HRAS” or “KRAS” .

On the other hand, data from profiling of cancer tissues demonstrate critical contribution of cholesterol metabolism to cancer origin .

It is always interesting to see what genes are involved in the enriched pathways, and compare them to our top previously obtained DE genes. We can do that by accessing the “symbol” term of the GO object.

geneID <- geneIdsByCategory(hgOver)[goresults$GOBPID]
geneSYM <- sapply(geneID, function(id) select(org.Hs.eg.db, columns="SYMBOL", key=id, keytype="ENTREZID")$SYMBOL)
set1 <- geneSYM[1:6]
set1 
$`GO:0032700`
 [1] "ARG2"    "IFNG"    "IL12B"   "TNFSF4"  "IL27RA"  "IL36RN"  "FOXP3"  
 [8] "VSIR"    "MIR26A1" "MIR26A2"

$`GO:0043162`
 [1] "TSG101" "HDAC6"  "TOM1L1" "VPS4A"  "RNF115" "VPS28"  "UBAP1"  "RNF126"
 [9] "MVB12B" "VPS37A"

$`GO:0071711`
 [1] "CTSS"   "DAG1"   "GAS2"   "HPN"    "ITGB1"  "SPINT2" "CLASP2" "CLASP1"
 [9] "FERMT1" "PHLDB2"

$`GO:0099632`
 [1] "HIP1"    "RAB8A"   "CPLX1"   "GRIP1"   "NSG1"    "VPS35"   "GRIPAP1"
 [8] "GRIP2"   "SNAP47"  "GRASP"  

$`GO:0045540`
 [1] "ACACB"   "CYP51A1" "DHCR7"   "FDFT1"   "FDPS"    "HMGCR"   "KPNB1"  
 [8] "RAN"     "SP1"     "SQLE"    "SREBF1"  "GGPS1"   "ABCG1"   "GPAM"   
[15] "ELOVL6" 

$`GO:0048701`
 [1] "ALX3"    "RUNX2"   "GNAS"    "SIX1"    "TULP3"   "TWIST1"  "IFT140" 
 [8] "EIF4A3"  "SIX2"    "SLC39A1" "SETD2"   "SLC39A3" "SIX4"    "GRHL2"  
[15] "RDH10"  

We could notice that none of our top 5 DE genes (represented in Figure 27: Model 3 - Volcano plots of DE) are listed in the previous pathways, but there might be 2 reasons for that:

  1. We only visualize the top 5 genes, but over 6000 were considered to find the pathways and they were selected per counts, so the more genes contributing, the more significant the pathway was. Which means it counted more the amount of genes involved in it, rather than the specific contribution of 1.

  2. We only picked 5 genes in order to make the analysis simpler and more bearable when comparing 3 models, so probably if we had increased the number, some of the top genes would appear in the list

Nevertheless, all the pathways are significant in cancer pathogenesis and also there appear highly significant genes, such as the mentioned TSG101 and FOXP3 , so we consider that our approach worked fine.

As we made the distinction of those genes that are overexpressed and underexpressed in tumor samples, we can repeat the process to detect enriched sets taking into account this.

For overexpressed genes:

conditional(over_params) <- TRUE
over_hgOver <- hyperGTest(over_params)
over_goresults <- summary(over_hgOver)
head(over_goresults)
      GOBPID       Pvalue OddsRatio   ExpCount Count Size
1 GO:0071480 3.103515e-05 10.746855   4.081329    12   15
2 GO:1902547 7.728550e-04 10.731554   2.720886     8   10
3 GO:0006066 1.168096e-03  1.641336  48.975942    68  180
4 GO:0035295 2.029701e-03  1.312613 157.267192   188  578
5 GO:2001034 2.299822e-03  9.386426   2.448797     7    9
6 GO:1902036 3.091555e-03  2.365998  12.788163    22   47
                                                                             Term
1                                            cellular response to gamma radiation
2  regulation of cellular response to vascular endothelial growth factor stimulus
3                                                       alcohol metabolic process
4                                                                tube development
5 positive regulation of double-strand break repair via nonhomologous end joining
6                           regulation of hematopoietic stem cell differentiation
over_goresults <- over_goresults[over_goresults$Size >= 10 & over_goresults$Size <= 300 & over_goresults$Count >= 10 , ]
over_goresults <- over_goresults[order(over_goresults$OddsRatio, decreasing=TRUE), ]
head(over_goresults)
       GOBPID       Pvalue OddsRatio ExpCount Count Size
1  GO:0071480 3.103515e-05 10.746855 4.081329    12   15
7  GO:0090630 3.643107e-03  3.579140 5.713860    12   21
29 GO:0090169 1.023404e-02  3.353299 4.897594    10   18
26 GO:0051293 7.863013e-03  3.279590 5.441771    11   20
17 GO:1905508 6.049304e-03  3.220755 5.985948    12   22
9  GO:0045824 4.316477e-03  2.877382 7.890568    15   29
                                                    Term
1                   cellular response to gamma radiation
7                          activation of GTPase activity
29                        regulation of spindle assembly
26                 establishment of spindle localization
17 protein localization to microtubule organizing center
9          negative regulation of innate immune response

For underexpressed genes:

conditional(under_params) <- TRUE
under_hgOver <- hyperGTest(under_params)
under_goresults <- summary(under_hgOver)
head(under_goresults)
      GOBPID       Pvalue OddsRatio ExpCount Count Size
1 GO:0006221 0.0009313583  4.194545 6.494145    14   22
2 GO:0090231 0.0011533687 16.753435 2.361507     7    8
3 GO:0090266 0.0011533687 16.753435 2.361507     7    8
4 GO:0015985 0.0012080726  4.791742 5.313392    12   18
5 GO:0071392 0.0014394037  5.987152 4.132638    10   14
6 GO:0034080 0.0017205027  3.135425 8.855653    17   30
                                                            Term
1                     pyrimidine nucleotide biosynthetic process
2                               regulation of spindle checkpoint
3   regulation of mitotic cell cycle spindle assembly checkpoint
4 energy coupled proton transport, down electrochemical gradient
5                        cellular response to estradiol stimulus
6                          CENP-A containing nucleosome assembly
under_goresults <- under_goresults[under_goresults$Size >= 10 & under_goresults$Size <= 300 & under_goresults$Count >= 10 , ]
under_goresults <- under_goresults[order(under_goresults$OddsRatio, decreasing=TRUE), ]
head(under_goresults)
       GOBPID       Pvalue OddsRatio ExpCount Count Size
5  GO:0071392 0.0014394037  5.987152 4.132638    10   14
4  GO:0015985 0.0012080726  4.791742 5.313392    12   18
12 GO:0034367 0.0031770737  4.788997 4.427826    10   15
10 GO:0045740 0.0027829946  4.390840 5.018203    11   17
11 GO:0048485 0.0027829946  4.390840 5.018203    11   17
1  GO:0006221 0.0009313583  4.194545 6.494145    14   22
                                                             Term
5                         cellular response to estradiol stimulus
4  energy coupled proton transport, down electrochemical gradient
12                          protein-containing complex remodeling
10                         positive regulation of DNA replication
11                         sympathetic nervous system development
1                      pyrimidine nucleotide biosynthetic process

This differenciation of over and under expression allowed us to reafirm the results, as the pathways overexpressed are related to pathogenesis, and the underexpressed are related to damage control and cell cycle activities.

One example is “GO:0071392”: Previous studies have proved that estrogen plays an essential role in endometrial cancer cell proliferation .

8 Concluding remarks

  • A 23 paired data analysis enabled us to avoid batch effect and to have a balanced set being the main source of variation driven by the tumor and normal condition.

  • 6540 genes have been detected as differentially expressed genes using the Adjust for unknown covarites method.

  • The functional enrichment analysis identified GO terms that are related to cancer tumorigenesis and progression.

8.1 Session information

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GO.db_3.8.2                 org.Hs.eg.db_3.8.2         
 [3] GOstats_2.50.0              graph_1.62.0               
 [5] Category_2.50.0             Matrix_1.2-17              
 [7] sva_3.32.1                  genefilter_1.66.0          
 [9] mgcv_1.8-28                 nlme_3.1-139               
[11] geneplotter_1.62.0          annotate_1.62.0            
[13] XML_3.98-1.20               AnnotationDbi_1.46.0       
[15] lattice_0.20-38             edgeR_3.26.4               
[17] limma_3.40.2                SummarizedExperiment_1.14.0
[19] DelayedArray_0.10.0         BiocParallel_1.18.0        
[21] matrixStats_0.54.0          Biobase_2.44.0             
[23] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0        
[25] IRanges_2.18.1              S4Vectors_0.22.0           
[27] BiocGenerics_0.30.0         knitr_1.23                 
[29] BiocStyle_2.12.0           

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1         xfun_0.7               splines_3.6.0         
 [4] htmltools_0.3.6        yaml_2.2.0             RBGL_1.60.0           
 [7] blob_1.1.1             survival_2.43-3        DBI_1.0.0             
[10] Rgraphviz_2.28.0       bit64_0.9-7            RColorBrewer_1.1-2    
[13] GenomeInfoDbData_1.2.1 stringr_1.4.0          zlibbioc_1.30.0       
[16] codetools_0.2-16       evaluate_0.14          memoise_1.1.0         
[19] GSEABase_1.46.0        Rcpp_1.0.1             xtable_1.8-4          
[22] BiocManager_1.30.4     XVector_0.24.0         bit_1.1-14            
[25] digest_0.6.19          stringi_1.4.3          bookdown_0.11         
[28] grid_3.6.0             tools_3.6.0            bitops_1.0-6          
[31] magrittr_1.5           RCurl_1.95-4.12        RSQLite_2.1.1         
[34] pkgconfig_2.0.2        rmarkdown_1.13         AnnotationForge_1.26.0
[37] compiler_3.6.0        

9 References

  • Ahn SH, Edwards AK, Singh SS, Young SL, Lessey BA, Tayade C. IL-17A Contributes to the Pathogenesis of Endometriosis by Triggering Proinflammatory Cytokines and Angiogenic Growth Factors. J Immunol. 2015;195(6):2591–2600. doi:10.4049/jimmunol.1501138

  • Chang JG, Su TH, Wei HJ, et al. Analysis of TSG101 tumour susceptibility gene transcripts in cervical and endometrial cancers. Br J Cancer. 1999;79(3-4):445–450. doi:10.1038/sj.bjc.6690069

  • Ghanbari Andarieh M, Agajani Delavar M, Moslemi D, Esmaeilzadeh S. Risk Factors for Endometrial Cancer: Results from a Hospital-Based Case-Control Study. Asian Pac J Cancer Prev. ;17(10):4791–4796. Published . doi:10.22034/apjcp.2016.17.10.4791

  • Gorin A, Gabitova L, Astsaturov I. Regulation of cholesterol biosynthesis and cancer signaling. Curr Opin Pharmacol. 2012;12(6):710–716. doi:10.1016/j.coph.2012.06.011

  • Leslie KK, Thiel KW, Goodheart MJ, De Geest K, Jia Y, Yang S. Endometrial cancer. Obstet Gynecol Clin North Am. 2012;39(2):255–268. doi:10.1016/j.ogc.2012.04.001

  • Svoronos N, Perales-Puchalt A, Allegrezza MJ, et al. Tumor Cell-Independent Estrogen Signaling Drives Disease Progression through Mobilization of Myeloid-Derived Suppressor Cells. Cancer Discov. 2017;7(1):72–85. doi:10.1158/2159-8290.CD-16-0502

  • Young T.W., Mei F.C., Rosen D.G., Yang G., Li N., Liu J., Cheng X. Up-regulation of tumor susceptibility gene 101 protein in ovarian carcinomas revealed by proteomics analyses. Mol. Cell. Proteomics. 2007;6:294–304. doi: 10.1074/mcp.M600305-MCP200.